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Abstract 

A Lie-algebra based recipe for smoothing gauge links in lattice field theory is presented, 
0^ ■ building on the matrix logarithm. With or without hypercubic nesting, this LOG/HYL 

smearing yields fat links which are differ entiable w.r.t. the original ones. This is essential 
£S) | for defining UV- filtered ( "fat link" ) fermion actions which may be simulated with a HMC- 

type algorithm. The effect of this smearing on the distribution of plaquettes and on the 
residual mass of tree- level (9(a)-improved clover fermions in quenched QCD is studied. 



On: 



1 Introduction 



In lattice field theory one is generally interested in taking the continuum limit where the lattice 
spacing a tends to zero. In other words, one calculates the ratio R(o) of two physical masses or 
matrix elements and considers the limit where the correlation length in lattice units diverges, 
£/a — > oo. Since the physical box volume is supposed to be roughly constant, the total number 
of variables (and hence the CPU time needed) grows with a large power of a -1 . 
O To speed up the computation of the continuum ratio two concepts have proven important. 

The first one is known as Symanzik improvement pQ. Here one augments both the action and 
the observable by irrelevant terms. Upon tuning some coefficients it may be achieved that 
! the continuum value is approached at a quadratic rate, R(a) = R(0) + const (a/r ) 2 + 0(a 3 ), 
rather than linearly, R(a) = R(0) + const a/r + 0(a 2 ), where r is a fixed length. Hence, if the 
Symanzik scaling window (the regime where the "const" term dominates) starts at a~0.1fm 
and one wishes to cover three quarters of the variable in which one extrapolates linearly, the 
finest lattice will have a ~ 0.05 fm with improvement versus a ~ 0.025 fm without. 

The second ingredient in todays state-of-the-art simulations is some damping of ultra-violet 
(UV) fluctuations, that is of unphysical excitations at the scale of the cut-off a -1 . To maintain 
locality, such damping is achieved via locally averaging field variables. In case of a gauge theory 
one replaces the link U^(x), the parallel transporter from x+jl to x, by U'(x). The latter is 
some average of paths, attached to the same endpoints, which stay within a local neighborhood 
of x and x+fi. From a formal viewpoint such smearing amounts to another change of the action 
by irrelevant terms; thus changing "const" in the extrapolation law, but not the power (i.e. 
the Symanzik class is unchanged). Data suggest that some mild smearing enlarges the scaling 
window and reduces the "const" (in absolute magnitude) in the extrapolation law [21 [3]. 

Specifically for lattice QCD it has been observed that these two strategies, when applied 
together, enhance each others effectiveness. With staggered quarks this has been demonstrated 
in a somewhat indirect manner. Upon combining a Naik term with a specific UV-filtering 
known as "AsqTad" smearing the MILC collaboration has been able to simulate large volumes 
with standard lattice spacings and fairly light pion masses [I] . For fat-link clover quarks [2] the 
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enhancement has been investigated in detail in the quenched approximation [5], and there is a 
similar program for large-scale dynamical simulations with almost-realistic quark masses [Bj. 

What remains is an engineering issue. With the smearing being part of the action or operator 
definition, it seems natural to keep the parameter a and the iteration level n iter unchanged as 
the lattice spacing is reduced. This way locality is guaranteed, and (a, n iteT ) represents a handle 
to influence the overall cost, while the ratio R(0) is, ultimately, unchanged. Of course one may 
ask what is the "optimal" choice of (a, niter); m terms of CPU time, to achieve a predefined 
accuracy of R(0). In practice, however, one is interested in the continuum limit for a number 
of quantities, and often this list enlarges as the simulation program progresses. Thus, staying 
away from either extreme choice (i.e. no or excessive smearing) will often be sufficient. 

For theories with dynamical fermions an additional engineering constraint emerges. There is 
a single algorithm which scales, at fixed bare parameters, almost linearly with the box volume, 
known as hybrid Monte Carlo algorithm (HMC), with subvarieties called PHMC, RHMC [7]. 
This algorithm demands that the fermion action reacts smoothly to a change of the field of gauge 
variables U^(x). For fat-link actions this requires that the smeared link U' (x) be differentiable 
with respect to U^x), which amounts to a constraint on the smearing recipe. 

Historically, the first smoothing introduced has been APE smearing [8] 

f/ APE (x) = P sm {(l- a )U„(x) + -^- £ U^U^x+vpitx+P)} (1) 
= P sm {(l- a )I+-^-r J2 U u (x)U,(x + v)Ui(x+i,)Ul(x)}u,(x) (2) 

with Pq denoting the projection to the gauge group G. The rewriting in the second line is 
based on Psu(3){AU} = Psu(3){A}U , valid for the combination of an arbitrary matrix A and a 
special unitary U. Since the U(l) projection [cf. f|49| 1501) below for details] creates a headache 
in the HMC force, Morningstar and Peardon invented the "stout" smearing [9] 

U* xp (x) = exp (| J2 {[U„(x)U^x+v)Ut(x+fl)Ul(x) - h.c] - V.]}) U,(x) (3) 

subsequently dubbed EXP, which, by design, yields differentiable fat links. However, it turns 
out that (jBJ) is less effective in damping extremal plaquettes than (j2J). In Ref. [10] a modified 
nAPE smearing has been introduced where the projection is to U(3) only. The goal of this 
article is to test yet another smearing which yields an SU(3) valued differentiable link, with an 
efficient damping of the UV fluctuations, such that it could be used in full QCD. 

This article is organized as follows. The next two sections specify the new LOG smearing and 
show how it can be used together with the hypercubic nesting trick, defining the HYL smearing. 
Sections 4 and 5 investigate the impact on selected observables, in particular the distribution 
of plaquettes and the residual mass of fat-clover fermions. In the latter case evidence is given 
that M n ~ 160 MeV can be reached at standard /5-values in 0(a)-improved quenched QCD 
without hitting the so-called "exceptional configuration" problem. Sections 6 and 7 sketch how 
the LOG/HYL smearing is included in a HMC approach to full QCD and how the matrix 
logarithm may be used to define a new gauge action and the pertinent (gluonic) topological 
charge density. After a summary, three appendices give technical details. 
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2 Logarithmic link smearing 



The re- written form of the APE smearing contains the factor Psu(3){-} by which the original 
link U^ix) is multiplied. Upon sending the parameter a— >0 this prefactor is smoothly deformed 
into unity. If one wishes to stay in the group, the backprojection is needed, since the weighted 
arithmetic average of the identity and six (in 4D) "blades" (closed staples) is not a group 
element any more. Given the standard definition of the fractional power of a matrix 

U a = exp(alog(E7)) (4) 

one starts wondering whether it would make sense to define the smeared link as 

U » = ex P (oTTTTT £ \og[U y {x)U,{x+0)Ul{x+fi)Ul{x)]) U,{x) (5) 

and on a sufficiently smooth gauge background the rationale for this choice is as follows. The 
product U v (x)U fl (x+v)Ul(x+ fijU^x) is special unitary, its logarithm thus anti-hermitean and 
traceless. So is the sum, and this means that the exponential defines again a special unitary 
matrix which is to be multiplied onto the original link. Hence one stays in 577(3). 

The problem is that the argument has a loophole: On an arbitrary gauge configuration the 
logarithm is not traceless, it is just traceless modulo 2ni. Upon averaging this logarithm with 
the remaining five (in 4D) contributions, which we assume to be traceless, one gets an arbitrary 
trace. As a result, the exponential is still unitary, but its determinant is not 1, so one leaves the 
gauge group. While, even on a rough configuration, Tr \og[U u (x)U^(x + i>)Ul(x + ^U^x)] ^ 
is rare, if one wishes to stay in S77(3), this possibility needs to be accounted for. 

Obviously, by restricting the sum to traceless contributions, the goal of staying in the gauge 
group is reached. The most straightforward option is to define the smearing through 

U;{x) = exp E [l^pu(x)U^x+u)Ut(x+ii)Ul(x)] - ^Trlog[.]}) U,{x) (6) 

where, as mentioned before, the new term — |Trlog[.] is almost always zero. In tangent space 
the logarithm of the 4-link product may be decomposed as log[.] = J2a=i i£ a T a + i£ 9 7 with 
f 1 G R for a = 1, 8 and £ 9 G {0, ±2tt/3}. Here, T a = X a /2 with A a the Gell-Mann matrices. 
In ([6]) the 9th (radial) component is simply projected away, and after this correction, the six 
contributions to the tangent space are again subject to an arithmetic average. 

A more general approach may allow for unequal weights of the six staples, depending on 
whether their logarithm is traceless or not. Hence, a more sophisticated version is 

U';(x) = exp (— ^— J2 c ±u {\og[U u (x)U,(x + v)Ui(x+f,)Ul(x)} - ^Trlog[.]}) U,(x) (7) 

where the coefficient c± u is a function of log[f/ !/ (x)?7 Ai (x-|-z>)?7j(x-|-/i)Z7^(x)], for instance c v = +1 
if Tr log[.] = and c v = —0.5 if Tr log[.] = ±27ri. Still, in view of the application as an ingredient 
in a HMC, this definition seems less attractive, since it entails a rather complicated force. 

Finally, one may choose to invoke a non-principal logarithm. In fact, thinking in terms of 
the eigenvalues Aj of the original matrix (which we assume to be special unitary) it is natural to 
attribute a possible occurrence of Tr log[.] = +27ri to the Aj with the largest imaginary part of 
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log(Aj) (and ditto to the one with the smallest imaginary part of log(Aj) for Trlog[.] = — 2iri). 
Accordingly, a "trace-free" matrix logarithm may be defined which basically shifts the logarithm 
of one eigenvalue of the original matrix by ±27ri in those cases where the principal logarithm 
has non-zero trace. With this function at hand, we define 

Uj: OG (x) = exp (^tt E tnog[U u (x)U,(x+u)Ut(x+ix)Ul(x)}) U,(x) (8) 

and below, whenever referring to the LOG recipe without specification, the recipe (jSJ) will be 
meant. This "trace-free" logarithm seems most interesting, because in full QCD this version 
amounts to a smooth adaptation of the definition to the actual gauge field, which has a good 
chance to avoid large HMC forces. For implementation details of all varieties see App. A. 

The construct (jHJ) retains all symmetry properties of the original link, in particular the 
behavior under gauge transformations, charge conjugation, reflections, and permutations of the 
coordinate axes. It produces a link in SU(3), with an obvious generalization to SU(N C ), and 
the new U^f^^^x) is differentiable with respect to the original U^{x). The normalization of 
the parameter a, which determines the weight of the fluctuation, has been chosen such that 
in leading order perturbation theory the new LOG recipe (jHJ) agrees with the APE recipe (J2]). 
The same holds true for the stout/EXP recipe ([3]) if an extra factor l/(2(d— 1)) is included [5]. 

3 Hypercubic nesting trick 

To tame the noise in QCD observables, one would like to iterate the smearing, while keeping the 
delocalizing effect minimal. A nice strategy (inspired by the fixed-point action approach and 
the pertinent "perfect smearing") was presented in In this original form the hypercubic 
nesting trick uses the APE smearing (j2J) as core recipe, giving 

Psu(3){ (l-ct 3 )/+ y E U (7 (x)U fl (x + a)Ul(x+fi)Ul(x)}U fl (x) 

Psu(3){(l-ai)I + ^ E V^(x)V^(x+0)Vl^x+pL)Ul(x)}u^ (9) 

where we stick to the notation of [11] in which oti denotes the fluctuation weight in the last step. 
This has been generalized to the case of "stout/EXP" smearing [5] and in complete analogy 

Wx)=exp(^ ]T mog[U a {x)U^x+a)Ul{x+jl)Ul{x)])U,{x) 

%, u {x) = exp J2 ^og^W^P^+^^V^+A)^^)])^^) 
±p+pv 

U* YL (x) = expQ ]T tflog[K, M (z) V^(x+u) Vj^z+pL) ^(x)])^(x) (10) 

is the hypercubically nested version of LOG smearing. With these formulae given, all aspects 
of logarithmic link smearing have been specified and we are ready for a numerical investigation. 
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Figure 1: Distribution of the plaquette on a 32 4 lattice at [3 = 5.8 (top) and (3 = 6.2 (bottom) 
after one step of APE /EXP /LOG smearing (left) or HYP/HEX/HYL smearing (right). To the 
right of the dashed line at s = 4/3 the non-principal definition of the log may prove relevant. 
Throughout perturbative equivalents of a ape = 0.6 and gjhyp = (0.75, 0.6, 0.3) have been used. 



4 Effect on selected gluonic observables 

The LOG/HYL smearing (JSX HDD rriay be used both in gluonic observables (e.g. for a fat-link 
topological charge or Polyakov loop) and in fermionic observables (e.g. for a fat-link Dirac 
operator). This section is devoted to the effect on one quantity in the pure gauge theory, the 
distribution of the plaquette. The comparison between the different smearing recipes will be 
organized both in an unfair manner (sticking to the perturbative equivalence rule described in 
Sect. 2) and in a fair way (using optimized parameters in each recipe). 

Fig.[T]shows the distribution of the plaquette variable s^ v (x) = 1 — ReTr(?7 M ^(x))/3 on a 32 4 
grid in a pure gauge ensemble with the Wilson action at (3 = 5.8 and (3 = 6.2. The distribution 
vaguely resembles a black body radiation curve - it starts out with a power law and ends with 
a more-or-less exponential tail which, at not-so-large (3 is affected by the fact that s < 1.5. 
Smearing shifts the distribution towards a "colder" temperature, in particular the probability 
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32 4 , p=5.8, perturbative equivalents of a =0.6 32 4 , p=5.8, pert, equivalents of a =(0.75,0.6,0.3) 





Figure 2: The same as in Fig. 1, but after three iterations of the smearing have been applied. 



of large or extremal plaquettes is suppressed. Taking APE smearing as a standard, EXP (alias 
"stout") is somewhat less effective, while the new LOG smearing is at least as effective. In this 
comparison perturbatively equivalent parameters have been used, that is oape = Q^log = 0.6 
and oexp = 0.1 (see [5] for details). In the right panel a similar comparison is given for 
one step of HYP, HEX or HYL smearing. Again the perturbatively equivalent parameter set 
c*hyp = «hyl = (0.75,0.6,0.3) and «hex = (0.125,0.15,0.15) has been used, and with this 
specific choice HYP and HYL are more efficient than HEX. The two lower panels show that the 
difference between the smearings is reduced as /3 increases, in particular with the hypercubic 
nesting trick in place. 

Fig. [2] shows the same distributions as Fig.HJ but this time after three iterations of the 
smearing. Again, APE and LOG are more effective than EXP, and HYP and HYL are better 
than HEX, albeit the difference seems less pronounced than in the previous figure. 

In these figures the "trace-free" logarithm has been used. It turns out that the principal 
logarithm (jSJ) yields an even better suppression of extremal plaquettes, but for reasons discussed 
in Sect. 2 the "trace-free" version seems more promising for an application in full QCD, and I 
have chosen to consistently show the results for this non-principal logarithm. 
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Figure 3: Average action s (left) and median-plusSa-percentile of s (right) at [3 = 5.8,6.0,6.2 
(from top to bottom) versus smearing parameter a, after 1 iteration of APE/EXP/LOG 
smoothing. To make the three a commensurate, the perturbative equivalence rule has been 
used, but the 1 EXP data have also been redrawn as a function of a e g as defined in ( ITTj) . 
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Figure 4: Same as Fig.Q except that the effect of 3 iterations of APE/EXP/LOG smoothing 
is shown. The 3 EXP data have been redrawn as a function of a e g, as defined in (TTTj) . with (P) 
denoting the plaquette of the original (unsmeared) configuration. 
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Evidently, a fair comparison should use "optimized" values for each smearing. However, 
it is not so clear which quantity sould be used in the optimization. For instance, statements 
about the effect of smearing on both the average plaquette and extremal plaquettes are found 
in the literature [TT]. Moreover, once a specific quantity to optimize for has been selected, at 
least two characteristics should enter the comparison: 

(i) The value of the selected observable in the extremum. 

(ii) The width of the "near-minimal" region, i.e. whether the optimal smearing effect requires 
a fine-tuning of the parameters or whether any "near-optimal" choice will do fine, too. 

To address these questions a scan over different smearing parameters in the APE/EXP/LOG 
recipes has been performed. The focus has been on the average plaquette s avg and an "extremal" 
plaquette, here defined through s + 3 a which is meant to denote the 0.99865 percentile, rather 
than s max which is rather noisy. Data for (3 = 5.8,6.0,6.2 are collected in Figs. [3] and H] for 1 
and 3 iterations, respectively. All results are plotted as a function of a ape, 6aE X p, «log; but 
following the suggestion of [TDj the EXP/stout data are also plotted as a function of 

acS ~ l + 6a EX p(l-<P» (U) 

where (P) denotes the plaquette (i.e. 1 — s) of the original (unsmeared) configuration. 

Regarding point (i), from Fig. [3] it is evident that the minimum of the average plaquette 
action is almost the same with all three smearings. Only the position at which it is assumed is 
somewhat different with EXP (it seems a ~ 0.7 is optimal for APE and LOG, and 6a ~ 1 is a 
good choice for EXP), but most of this effect is gone if the latter data are plotted as a function 
of a e Q [TO] • On the other hand, the "extremal" plaquette is quite sensitive to the details of the 
smearing recipe (right panel). With this criterion LOG smearing is significantly better than 
APE, and the latter is slightly more effective than EXP. However, this advantage diminishes 
towards large f3, and also for higher iteration counts as Fig.@] shows. 

Regarding point (ii), we learn from Fig.[3] (right panel) that again LOG smearing has a 
slight advantage over the other two in showing a broader shape in the "near-minimal" region. 
In other words, this recipe seems more robust w.r.t. the details of the smearing parameter, 
in particular if an observable is chosen which is sensitive to near-extremal plaquettes. Again, 
larger (3- values and higher iteration counts tend to diminish the effect. 



5 Effect on selected fermionic observables 

The main rationale for the LOG/HYL smearing fISl fTU]) has been to create a smearing that may 
be used as an ingredient in a fermion action suitable for full QCD simulations with the HMC 
algorithm. Therefore it is important to test a fat-link fermion with this kind of smearing and 
to compare it to similar actions where APE/HYP or EXP/HEX smearing has been used. This 
may be done with Wilson or with staggered quarks. In the former case the focus should be 
on chiral symmetry breaking, in the latter case on taste symmetry violation. For definiteness, 
I concentrate on Wilson fermions (with a clover term) and I choose the simplest observable 
sensitive to chiral symmetry violation: I measure the residual mass m res , defined as the PCAC 
mass at bare mass m = 0. Lacking the CPU power needed to simulate full QCD, I choose the 
quenched theory as testing ground. In fact, from an engineering viewpoint this choice is likely 
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Table 1: Residual mass am res , defined as the PC AC quark mass at zero bare mass, for various 
couplings and smearings. The box geometry is L 3 xT with T = 2L, and errors are statistical. 



to be more indicative of the quality of the smearing, since there is no determinant which would 
mitigate the effect of spurious almost-zero modes. 

With standard conventions the (r — 1) Wilson operator takes the form 

D w (x, y) = - {(7 M - rjU^xjSs+fry - (7 M + /)^£0 - £)4-a,2/} + l^Kv ( 12 ) 

with I a 4 x 4 spinor matrix and 1/(2k) = 4 + m . The Sheikholeslami-Wohlert clover operator 
follows by subtracting a hermitean contribution proportional to the gauge field strength [T2] 

Dsw{x, y) = D w (x, y) - a ^ F ^ $x, y (13) 

with crfj,u — k{lij,ilu\ an d Ffiu the hermitean clover-leaf operator. The same kind of UV-filtering 
is applied to the covariant derivative and to the clover term. In other words, the gauge link 
U^ix) in (TT21) is replaced, for instance, by U^ OG (x) and the field-strength tensor F^ u (x) in (fT3l) 
is built from such links, too (see [3] for references to other options). Throughout, the tree- level 
improvement coefficient csw = 1 is used. The goal is to test the smearings and to compare the 
non-perturbative data to the perturbative 1-loop prediction for m ies . 

The details of the gauge configurations can be read off from the heading of Tab.[TJ The box 
geometry is L 3 xT with T = 2L. The five couplings between (3 = 5.6 and (3 = 6.4 and the grid 
sizes have been chosen such that the physical box volumes are approximately equal, aiming for 
L/ro — 3 by the formula for ro(/3) from [13] . The values in the a _1 [GeV] line are based on the 
assumption ro = 0.5 fm and indicate that the lattice spacing varies by a factor 4. A point- or 
U(l) wall source has been used, and the point-sink has been averaged over the time-slice. The 
inversion of the clover operator has been performed with an even-odd preconditioned version 
of the biconjugate gradient 75 algorithm (BCG75), with details given in App.B. 
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Figure 5: Individual and averaged pion correlators at f3 = 6.4 and amo = with 3 LOG steps, 
together with resulting pion mass and PCAC mass (everything in lattice units). 



An illustration of the observables studied is given in Fig.[5l based on the run at /3 — 6.4 
with the 3 LOG action. Using both ipls^ an d V , 747sV ; as interpolating fields the direct and 
crossed correlators have been calculated, labeled as "PP", "AA" and "PA/AP", respectively. 
The individual correlators and the pertinent ensemble averages are shown in the first and second 
panel. The third plot indicates the resulting effective masses for M n . The pion mass seems to 
be around M v ~ 0.17a" 1 = 650 MeV, but the box is not sufficiently long to see a good plateau. 
Finally, the fourth plot contains the plateau of the PCAC mass, the latter being defined as 

(d 4 +dt)(A(x)P(0)) 
a ~ A(P(x)P(0)) ' 1 j 

In Fig. [6] the same observables are shown (again at (3 = 6.4) for the action with 3HYL steps. 
Evidently, with amo — kept fixed, the pion is much lighter now. By comparing the first 
two plots to their counterparts in Fig. [5] one sees that the correlators are much fuzzier now. In 
consequence, the effective mass plot for M n does not show a good plateau at all. Luckily, the 
PCAC mass is still easy to determine, and this is sufficient for the present investigation. 
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In Tab.[T]the values of the residual mass am PCAC at am = are summarized for 13 fermion 
actions at 5 couplings. The same (perturbatively equivalent) smearing parameters have been 
used as in Figs. [I] and [2J since we want to match on a common perturbative prediction (see 
below); results should not be interpreted as to give a fair comparison of the smearing recipes 
per se. Whether actions with n^er = 3... 7 suffer from "strong derealization" effects (read: have 
bad scaling properties in some observables) is, at this moment, not clear. In state-of-the-art 
phenomenological studies one has several lattice spacings and thus means to check. The 7 HYL 
line has been included to indicate that there is no sign of a saturation of am res versus n- lter in 
the range studied. The main message from Tab. [I] is rather encouraging; the LOG smearing 
defines a clover action with surprisingly good chiral symmetry properties. So far residual masses 
am res ~10~ 3 have only been achieved with domain-wall fermions [H] . 

With Tab. [TJ the main test has been completed, but it is still interesting to compare the data 
to the prediction from 1-loop fat-link perturbation theory. To lowest order the residual mass 
am res relates to the frequently quoted critical mass am cr i t through 

m res = — — — (15) 

6 A 
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Figure 7: Graphical representation of the results of Tab. [I] for ^iter = 1 without (left) and with 
(right) hypercubic nesting. The common 1-loop prediction for each panel is shown as a straight 
dash-dotted line and is used as a constraint in fitting the data with j3 > 5.8 to the ansatz (T, 





Co 


C\ 


c 2 


c 3 


1APE 


[4.90876] 


0.40140 


-0.95201 


-0.90361 


1HYP 


[1.98381] 


-0.05780 


-0.68221 


-0.92911 


1EXP 


[4.90876] 


1.15872 


-1.51175 


-0.89583 


1HEX 


[1.98381] 


-0.03521 


-0.43517 


-0.91397 


1LOG 


[4.90876] 


0.39395 


-0.95340 


-0.90421 


1HYL 


[1.98381] 


-0.09322 


-0.65284 


-0.92974 


3 APE 


[0.77096] 


0.27440 


-0.74716 


-0.92960 


3 EXP 


[0.77096] 


0.64259 


-0.90602 


-0.92456 


3 LOG 


[0.77096] 


0.25769 


-0.72616 


-0.92955 



Table 2: Summary of the coefficients in the fit nT7\) for the 9 actions where Cq is known in PT. 



but the perturbative expansion Zx = l+O (g^) means that at leading order there is no difference 
between am res and am cr i t . Thus we may directly take the result from [5] for our parameters 



2 2 f 4.90876 (1APE/EXP/LOG, c sw = l) 

h n c % ' 

"0/7J — " 

16vr 2 12tt 2 



am res = -^C F S = 1.98381 (1 HYP/HEX/HYL, c sw = 1) (16) 



0.77096 (3 APE/EXP/LOG, c sw = l) 

and confront it with the data. In Fig. [7] the data with n iteT = 1 are plotted against g\ = 6/j3. 
Since we are outside the regime where (TIB"]) is adequate it is natural to use the rational ansatz 



am res = ——: c — 2 ( 17 ) 

12tt 2 1 + c 3 g$ 



to fit the data, where the perturbative constraint (TlBj) is built in by setting Co = S. The results 
of the fits to the data at /3 > 5.8 are summarized in Tab. [2] With these values in hand, the 
residual mass for any of these 9 actions can be accurately predicted in the range (3 > 5.8. 

Finally it is interesting to convert the residual masses into physical units and to (MS, 2 GeV) 
conventions. In the first step we use again the formula from [13] ; the result is shown in Tab. El 
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(ft T 1 n\ 

(p, L/a) 


^0.0, UoJ [0.0,1/,) ^O.U, 10J [O.ZjZU) ^0.4, ZoJ 


ILOG 

1 HVT 
1 rl 1 Li 


0.9914(46) 0.9421(41) 1.0223(34) 1.1584(32) 1.3203(28) 

U.44ZD^oUJ U.o4U0^oUJ U.oZ04^/0J V.oooil^l I ) U.4Uoy(l(J 


3HYL 


U.4000^00J U.0140^00J U.Z04Z^OUJ U.ZOOZ^ZUJ u.ziy4^zyj 

0.1601(25) 0.0906(33) 0.0590(36) 0.0502(19) 0.0515(14) 


7HYL 


0.0698(18) 0.0401(22) 0.0215(25) 0.0150(18) 0.0135(14) 



Table 3: Same as Tab. [21 but converted to tq units, based on the formula for r^/a given in L 13]. 



(fi, L/a) 


(5.6,08) (5.8,12) (6.0,16) (6.2,20) (6.4,28) 


ILOG 
IHYL 


388(18) 381(17) 424(14) 489(13) 565(12) 
165(11) 131(12) 129(10) 144(07) 166(07) 


3 LOG 


166(12) 119(13) 103(12) 105(08) 113(11) 



Table 4: Same as Tab. [31 but converted to MeV and (MS, /i = 2 GeVj, based on Z m from / fT5)) . 



Assuming r = 0.5 fm, the second step is performed by means of m MS (fi) = Z m (a/x)m PCAC (a r ), 
and we use the 1-loop perturbative prediction [as usual, up to O(g^) contributions] 

Zm = Z? = (l - |yf - log(aV 2 )])" = 1 + J,[f - M«V)] (18) 

with the ingredients z s = 4.11106 for 1 APE/EXP/LOG, z s = -1.43930 for 3 APE/EXP/LOG 
and zs = —0.03678 for 1 HYP/HEX/HYL from [5]. The results are shown in Tab.Hl Assuming 
that M n = 140 MeV corresponds to m MS (2GeV) = 4 MeV we can now estimate the pion mass 
by multiplying 140 MeV with the square-root of m MS (2 GeV)/(4MeV). In fact, since Z m is so 
close to 1, even an estimate based on the bare PCAC mass might be accurate to 10% or so. 
In case of the 3 HYL action it predicts M„ ~ 320 MeV for f3 = 6.4, in fair agreement with the 
result from the effective mass plot. In case of the 7 HYL action the effective mass plot cannot 
be used as a check, but the estimate M n ~ 160 MeV might still be accurate within 20 MeV. 
Note finally that - with mo = or k — 1/8 fixed - all of this has been achieved in a regime (to 
be precise: at the edge of the regime) where we are safe against "exceptional configurations". 

6 Towards full QCD with the RHMC algorithm 

With the tests of the LOG/HYL smearing in the quenched case completed, it seems worth 
while to spend a first thought one how one would use this smearing in full QCD. 

The driving ingredient in a HMC algorithm is the molecular dynamics evolution which is 
determined by the so-called HMC force. Let D m be an arbitrary undoubled fermion action, 
implicitly dependent on the "thin" gauge links U. The pseudofermion action is defined as 

S pi = (0| M~ 1 ' 2 |0) = J cj>\x) M~ x l\x, y) <p(y) d A x d\j , M — D] n D m > (19) 

where <j) denotes a boson field with the spinor and color components of a standard Dirac flavor. 
We choose Nf — 1 for full generality; the version for even Nf is even simpler. The HMC force, 
finally, is defined as minus the derivative of 5 p f with respect to the thin links [7j. The standard 
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way to proceed is to make use of the fact that the p-th order diagonal rational approximation 
of x~ l l 2 over the relevant spectral range admits a partial fraction formulation [7] 



x 



fc =i x + Pk 

with ak>0 (k — l...p) and < /?i < ... </3 p . As a result, the pseudo-fermion force is given by 

^ P f = S'jt = X>* (0| (M + ft)" 1 M' (M + f3 k )~ l |0) (21) 
fc=i 

where the prime denotes the variation w.r.t. a single element of the gauge field A a Ax-\-fl/2), 
defined as a Gell-Mann component of \og(U ^{x)) . In other words, what is needed to work out 
S' pi is the derivative of M w.r.t. the gauge field. In explicit terms this is 

M ' = SF = + D <»S? = + (22) 

MM M 

where the derivatives (D^)' and (D m )' follow as a product of the variation of the chosen fermion 
action under a change of the "fat" links, times the inner derivative [IS]. In the very end, the 
variation dS p ,/dA°(x +i i) combines both typefl of derivatives. The former is standard, except 
that in the result a replacement U — > JJ LOG is needed. The latter is specific to the chosen 
smearing recipe and hence deserves a closer look. 

We restrict ourselves to the "trace- free" logarithm (jSJ). The product rule (jSlj) yields 



a 



+ J 3® ex P(^ri) E tflog[C^(x)C^(x+p)C^( a r+/z)C^(x)]) • J 9 ^,A,y (23) 



±P^M 



^Jtf^y = ( ^ (x)/ ® /3) '^){ eXP (^l) E tflogf^^^Cx+^C/t^+^C/t^)])^) 

and analogously for the daggered fat-links. Here we use the tensor notation where the l.h.s. is 
a 9 x 9 matrix with dU^ OG (x)ij/dU v (y)ki in the 3(j — l)-M-th row and 3(1 — l)+A;-th column. 
In other words, U^ OG (x) and U u (y) are "reshaped" into 9x1 column vectors, and then the 
derivative of the p-th element of the first with respect to the q-th element of the second appears 
in position (p,q). Likewise, on the r.h.s. the derivative is a 9 x 9 matrix, and U^x) has been 
"blown up" by tensoring with the identity such that the two 9x9 matrices would multiply 
according to standard matrix multiplication rules. Some details of the underlying formalism 



1 We use the standard definition (with U denoting a generic fat link) 

d(.) = c>(.) d(.) dU v d(.) dUt d{.) = d(.) dU v d{.) dUj d(.) = d(.) dU v d{.) dfJl 
dA» OA- dU v dA- + d ut dA* ' dU» dU u dU^ + QUt dU^ ' du}, dU v duj, dill 



in which the partial derivative w.r.t. a given link picks up only that link and not its hermitean conjugate, that 
is and are considered independent with regard to partial differentiation. 
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have been collected in App. C. The task is now to collect the various pieces that come from the 
derivative of the exponential. Upon applying the chain rule (1651) for matrix functions, I obtain 

^7^{ ex P(^7^£ tf MW = F 1 -F 2 -F 3 (25) 

^^{exp(^^£tflog[^ = F 1 -F 2 -F 4 (26) 
where each factor 

Fi = 9e j(* )C evaluated at X = £ tflog[^(x)^(x + p)^( x+/i ) f/ t (a;)] 



a _ atflog(X) 



2(d- 



TT £ evaluated at X = U p (x)U li (x+p)l/J ) (x+p l )U^(x) 



f s = Pl(x+mi(x) Y 9 i,. m<^m. 

= [U^x+p)Ul(x+fi)Ul(x)]'®I 3 6 U)P S x>y + [Ul{x+(i)Ul(x)]'®U p { 

f 4 = / 3 «8>[^(x)^(x+p)]- 

= E^(^y® [E/ p (a;)C^(a:+p)] ^,A'+A,s/ + h®[U p (x)Up(x+p)Ul(x+p)\ 5^S x>y 

is a 9 x 9 matrix, and similarly for the daggered fat-links. If the LOG smearing is iterated, 
several factors (I2"5l |2"4"|) are lined up, each one evaluated with the appropriate argument. 

For completeness, let us consider fat-link clover fermions, although this has been done before 
[T5] . Introducing, for each term in (12T|) . \r]k) = (M + 1 10) and \(k) = D m \rjk) we have 

= -^f = ib a *[(Vk\(nly\( k ) + (Ck\(Dj\ Vk )} (27) 
fc=i 

where we enjoy the benefits of dealing with a scalar function of a scalar argument (such that 
the full apparatus of App. C is not needed). From a glimpse at (TP2]) it follows that 



J2 «fc Tr s P in Vk( x + ^)(lp + J )Ck(x) - Cl0)(7,u - I)Vk(x+ft) (28) 



where the layout of the color structure on the r.h.s. is just adapted to whichever convention on 
the l.h.s. is chosen. Similarly, from a glimpse at f|T3|) it follows that 



dS'pf sw c sw ^ r t <9F A (y) t <9F A (y) -I 

^pG^y - T"?g a *^b (y) ^st^G^ c(y) + c ^^^^^J • (29) 

Upon putting the various expressions together, one has an analytic expression for the HMC 
force of a LOG-filtered clover (staggered, overlap, etc.) action. The generalization to the HYL 
smearing (jTUl) follows by lining up several factors of (I23| 124)) [with restricted sums] . 
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7 Modified gauge action and topological charge density 



In the numerical investigations of this note the traditional Wilson gauge action 

So = £ {l - ^ReTr(^(x))} (30) 

9o x,n<v iV c 

has been used. However, with the technical means to compute the matrix logarithm of the 
plaquette U^ u (x) = U p ,(x)U v (x+ ji)UUx+u)Ul(x) in hand, two modifications are possible. 

The first one concerns a constraint, to be added to whichever gauge action is used. Clearly, 
the discussion around (jHl-El) would have been much simpler, if we could be sure that there is 
no lattice in our ensemble with a single U pil/ (x) which has a non-trace- free principal logarithm. 
It can be shown that a sufficient condition for Tr\og(U^ u (x)) = is KeTr(U /Jil ,(x)) > —1. This 
suggests adding a piece which penalizes all plaquettes with Re Tr(U pu (x)) < 0, without affecting 
those with ReTr(U pu (x)) > 0. Alternatively, one may directly test for Tr\og(U til/ (x)) = ±2iri 
and assign, in such prohibitive extra action. This point is summarized through 

s _^ s f +7E^<^(-r)exp(l/r) with r = Re Tr(£^(x)) 
G ~~ ' G+ \ -iE^<,[Trlog([/^(x))] 2 with e«l. 

The second one concerns a modification of the "bulk" piece of the gauge action. Recalling 
that the rationale behind Wilson's choice (1301) is to come up with a simple recipe to produce a 
term F pi/ from U pu = exp(ig aF pu ) = l + ig a 2 F pu — 2#o° -^L — it is tempting to generate F pv 
directly from the logarithm of the plaquette and square it explicitly. With L pv {x) = tflog U pv {x) 
and L pv (x) = -[L^x) + L fll/ (x — jl) + L^ u (x — u) + L^x—pL — v)] at hand, it appears that 

S G [U] =a 4 ^[F^(x)F^(x)} = -\ ^[L pu (x)L pi/ (x)} (32) 

X,fl<U X,jJ,<U 

would be a rather natural choice. Likely this gauge action defines a theory with good scaling 
properties, good rotational symmetry and nice overlap with the perturbative regime, as it is 
rather similar to the Manton action |16j . On the other hand, tunneling of the topological charge 
needs to be investigated, and clearly (1321) is not as easy to simulate as the Wilson action (1301) . 
Of course, the same modification may be applied to FF, too. This yields the expression 

a 4 1 

= ^3 £ e^ pa Ti[F pv (x)F pa (x)} = - e^ pa Tr[L^(x)L pa (x)} (33) 

dZ7r x,fiupa 6Z7T 9o x,iwpa 

for the bare topological charge density. Unlike in the previous case, we do not expect this to 
be much better than the usual clover-leaf expression for the topological charge density, since in 
that case it is standard to perform an averaging of the anti-hermitean part of the plaquette. 
In the same spirit, it seems natural to define the clover action through 

Dssn{x, y) = D w (x, y) - ^ ( 34 ) 

Zl 9o p< v 

instead of fflBl [where F pv follows from the anti-hermitean part of the same 4 plaquettes which 
enter L pu \. Again, at least with some filtering, this change will be rather insignificant. 
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8 Summary and outlook 



The purpose of this paper has been to define and test a smearing which is applicable, in the 
context of the HMC algorithm, to studies of full QCD. The testing has been restricted to pure 
gauge theory observables and to clover fermions in quenched QCD, but it is clear that the main 
lesson concerns the fermion formulation per se. 

The key idea behind this LOG smearing is to do the "averaging" in the Lie algebra, such 
that one would end up with a smeared ("fat") link which is naturally in the original gauge 
group. This is a clear advantage if one intends to use such fat-links in observables which are 
sensitive to the structure of the gauge group (e.g. Polyakov loops), but it might also simplify 
calculations in fermionic systems (e.g. by providing a more direct connection to perturbation 
theory). Whenever a single LOG smearing (jSJ) is not enough, it seems advisable to use the 
hypercubic nesting to define the HYL smeared links (jlOj) . 

Tests in the pure gauge theory have shown that the new LOG smearing is at least as effective 
at suppressing UV fluctuations as known alternatives. From a practical viewpoint the LOG 
smearing seems to have an advantage in having a rather broad "near-optimal" region, i.e. good 
results do not require any fine-tuning of the smearing parameter. 

An important point is that the LOG/HYL smearing yields a fat-link which is differentiable 
with respect to the thin links it is built from. This makes it a useful ingredient in defining 
a UV-filtered fermion action which can be used to study full QCD with the HMC algorithm. 
The tests in the quenched theory have shown quite convincing results, though most of the 
good properties of such actions are not specific to the LOG/HYL recipes. These smearings are 
efficient in the sense that modest parameters and/or iteration counts lead to clover-fermions 
with quite acceptable chiral properties. It seems that, upon iterating the smearing, arbitrarily 
small residual masses can be attained. The smallest value found, am res = 0.0014(1) with 7 HYL 
steps at /3 — 6.4, is in a regime which, without smearing, can only be accessed by domain-wall 
fermions [Hj. It seems noteworthy that these results have been obtained with non- negative 
bare mass where one is immune against "exceptional configurations" . An obvious continuation 
of the current work would be to compare the scaling properties of such actions against mildly 
filtered or unfiltered ("thin link") varieties. From Fig.[6]one sees that the correlators of highly 
smeared actions at small quark mass become sensitive to individual instantons and/or other 
topological objects and get rather fuzzy. Likely, this is the flipside of any fermion action with 
good chiral properties. The present tests have focused on the residual mass, as this is the 
simplest observable sensitive to chiral symmetry breaking. Perhaps a study of the spectral gap 
of the hermitean Wilson/clover operator with LOG/HYL-filtering would be useful, too. 

Wilson fermions have been out of fashion for over a decade, since in the original formulation 
chiral symmetry is broken in a rather severe way, and standard Symanzik improvement alone 
does not bring a major change in this respect. It seems almost guaranteed that, upon combining 
a clover term with a modest amount of link-fattening (e.g. two HYL steps), the clover action 
is rendered a competitive choice for pushing towards M n = 140 MeV in full QCD. 
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A Matrix exponential, matrix logarithm and more 



Among the main concepts for computing matrix valued functions on the computer, viz. 

(i) eigenvalue and eigenfunction based routines 

(ii ) clever use of the Cayley-Hamilton theorem 
(Hi) iterative methods 

a combination of (i) and (ii) often yields the most efficient implementation. Nonetheless, it 
turns out that iterative methods - though originally designed to deal with medium-size full and 
huge sparse systems - represent a good choice already for 3x3 matrices, since they are easy to 
implement, numerically stable, and still reasonably fast in terms of CPU time. Below a robust 
implementation of the matrix exponential and the matrix logarithm is described. 

A.l Matrix exponential with iterative methods 

The matrix exponential of an arbitrary mxm matrix A is defined through (with A = I) 

exp(A) = £ — (35) 

k=0 

since this series has infinite convergence range. A general recipe to speed up the convergence 
is known as the "scaling and squaring" method. It is based on the trivial observation 

cxp(A) = [exp(A/2 n )] 2 " (36) 

which is paraphrased as "take exp(A/2) and square it, and nest this n times". The point is, of 
course, that A/2 has smaller eigenvalues (or singular values) than A, making the power series 
for A/2 converge faster than the one for A. Clearly, theorems may be derived for the optimal 
choice of n. However, in practice it is often sufficient to start with a reasonable n m i n , and to 
increase it until the result is stable. An implementation of the estimator E n (.), in which also 
the precision of the rational approximation R n (.) is gradually increased, reads 

E n . —\ — I 

"■mm A 

for n = n min : oo 

S n = A/2\ R n = [Ella U S nmZl n =o U- S nm k }-\ E n = {R n f n 
if \\E n - E n _i\\ < e exit 
end (37) 

where 1 1 . 1 1 is any matrix norm and e is usually chosen slightly larger than the machine precision. 
An aside: In a HMC algorithm one needs to calculate exp(A) for A e su(3). In order to guarantee 
reversibility, one demands that E(— A) is the exact inverse of E(A), even with a low grade E(.). 
Our representation R n (A) — [I + A/2 + ...][/ — A/2 + and thus E n (.), is of this form. 



A. 2 Matrix logarithm with iterative methods 

The matrix logarithm of a non-singular mxm matrix A is, in general, not unique. However, if 
A has no eigenvalues on the closed negative real axis, then there is one solution to the equation 
exp(X) = A for which all eigenvalues Xi of X satisfy — n < Im(xj) < 7r (i — 1, ...,m), and this 
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solution is called the principal logarithm of A and denoted log(A). Since there is no power 
series representation of log(A) which converges in the entire cut plane, the usefulness of an 
"inverse scaling and squaring" approach [T7], based on the identity 

log(A) = log(A 1/2 ") 2™ (38) 

is evident. It may be paraphrased as "take the square root, compute the logarithm, double it, 
and nest this n times" . Hence this approach leaves us with the problem to compute, with high 
precision, the principal square root of an arbitrary matrix A, but the gain is that the argument 
of the innermost ( "reduced" ) logarithm is, for large enough n, close to the identity. 

For the square root it is convenient to use an iterative method, too. The Newton iteration 

X k+1 = )^(X k + AX^) , X = A (39) 

with lim^^oo X k = A 1 / 2 has good theoretical properties, but its poor numerical stability renders 
it useless in practice [18] . Fortunately, the Denman-Beavers iteration [T9j for A with a spectrum 
contained in the cut complex plane (i.e. with no non-positive real eigenvalue) 

Yk+i = \{Y k + Z k l ), Y = A 

Z k+ i = 1-iZk + Y, 1 ), Z = I (40) 



2 



has the quadratic convergence pattern 



lim r fc = A 1/2 , lim Z k = A~ 1/2 



and is stable against round-off errors [18J. Throughout this appendix it is understood that the 
inverse is determined via a Gauss elimination (with pivoting whenever needed) and hence exact 
to machine precision. In particular, if one is interested only in A 1 / 2 (or A^ 1 ^ 2 ), the product 
form of the Denman-Beavers iteration [TBI 



Y k+ i = -n^ + Mfc- 1 ), Y = A 
Z k+ i = ^(I + M k l )Z k , Z = I 

M k+1 = l(M k + 2I + M k l ) , M = A (41) 

with the line for Z k+ i (or for Y k+ i) omitted and the quadratic convergence pattern from above 
(plus linifc^oo M k = I, and hence \\M k — I\\ < e as a natural exit criterion) proves superior. 
The reason is that only one inverse is required and the matrix to be inverted is, after a few 
steps, so close to the identity that no pivoting is needed (which is primarily useful when dealing 
with large matrices on massively parallel systems). In general, one will also use the scaling 
idea to accelerate this product form of the Denman-Beavers iteration (see [TS] for details), but 
in our case det(A) = 1 implies that no scaling is needed in this step. Note that (1401 14T|) are 
non-standard in the sense that the matrix A does not show up in the iteration step. 
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For the second ingredient, the logarithm of a matrix close to the identity, one option is to 
utilize a diagonal rational approximation of log(l — x), for instance one of 



r n (x) 
r 44 (x) 



-2x 
2-x 
—6x + 3x 2 
6 — 6x + x 2 

-60s + QOx 2 - llx 3 
60 - 90x + 36x 2 - 3x 3 

-2940x + UlOx 2 - 1820x 3 + 175a; 4 
2940 - 5880x + 3780x 2 - 840x 3 + 42a; 4 



with X = 1 — A. Here it is understood that the numerator and denominator are at least 
evaluated by means of the Horner scheme, but the larger m in r mm (.), the more it pays to use 
a sophisticated method [20]. An alternative representation, which is easier to implement, is 



bg(z) = 2{(z-i)(z+iy 1 + l -\{z-i){z+iy^ + l -[{z-i)(z+i)-^ + ...} (42) 

which is appropriate for Re(zj) > 0,2^0 (Vi). In our situation this condition is met after the 
first square root has been evaluated. An aside: If the inversion is exact, upon feeding (j42l with 
Z^ 1 one obtains exactly the negative of what one gets with Z, in spite of the truncation. 
Putting things together, one may either opt for the more elaborate algorithm of Ref. [18] 



y(°) = a 

for i = 1 : oo 

choose ki according to Theorem 5.1 of Ref. 
M = Y^~ l \ Y = yC*- 1 ) 
for k = : ki — 1 

Y k+l = Y k (I + M k 1 )/2, M k+l = (21 + M k + M^)/A 

end 

M« = M fci , Y^ = Y ka n = i 

if ||7 - y( fc )|| < \ and (7.5) of Ref. [18] satisfied with m n < 8 exit 

end 

form the Pade approximant X = r mnrrin (I — y( n )) 

rescale and correct through X = X2 n - Er=i( M(i) - / ) 2 ^ 1 ( 43 ) 

which uses a Pade approximant of maximum order [8 / 8] for the reduced logarithm, or one may 
decide to stay with the simpler algorithm (starting again with a fixed n min >2) 

L n . -i = 

let 5 , „ min _i be the result of n min — 1 nestings of (j4"Tj) . each one run to machine precision 
for n = n min : oo 

let S n be the result of (|4T|) with A = S n -i, again run to machine precision 

Rn = 2Efc=l 2kZi[(Sn — I){S n + J)- 1 ] 2 *"' -1 

L n = 2 n R n 

if \\L n — L n _i|| < e exit 
end (44) 
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which is similar in spirit to (37) for the matrix exponential. In the second case one may choose 
to increment n by more than one unit. In these implementations no property of SU(3) matrices 
has been exploited. Accordingly, if the argument is known to be a special unitary matrix, it is 
useful to check that the matrix logarithm is anti-hermitean and traceless modulo 2ni. 



A. 3 Higher matrix roots with iterative methods 

A problem in our approach of nesting n inverse scaling and squaring steps is that n—1 times 
we use the output of a square-root operation as input for the next one. In this form round-off 
errors will accumulate and the approximant of the 2 n -fold root will deviate from A 1 / 2 " by an 
error which grows exponentially with n. It then is natural to look for a post-iteration which 
renders the result of the root-cascade exact. We thus face the question how to compute the 
fourth (eighth, etc.) root of a matrix, if a relatively good initial guess is already known. 

Having a reasonable guess for A 1 ^, we have, via a simple inversion, also a guess for A~ l l p , 
and vice versa. The simplest strategy is to run the (stable) Newton postiteration 

X k+l = ^Ax k - -Xl +P A , X ~ A- 1 ^ (45) 
p p 

with the approximate inverse of the p-th root as a starting value, and then invert the result. 
Note that this recursion is expensive; it requires n + 2 = log 2 (p) + 2 matrix multiplications per 
step. Still, since in a post-iteration typically just 1 or 2 steps are needed, this is acceptable. 



A. 4 Post-iteration of the matrix logarithm 

Alternatively, one might stay with the uncorrected n-fold square root cascade, and correct, 
instead, the final logarithm. The standard approach for this is to use the Newton iteration 

X n+1 = X n -I+ l -{e~ x -A + Ae~ x -) (46) 

where we have opted for the symmetric version. Again, the individual step is expensive (a new 
exponential is needed in each step), but for a post-iteration this is acceptable. 



A. 5 Matrix logarithm for unitary arguments 

For unitary argument the principal matrix logarithm is purely anti-hermitean, log(U) = iH with 
H = W and spec(if) G ]— 7T,7r[. Based on the experience with (and some of the ingredients 
from) the general algorithm (44) it is straight-forward to devise an algorithm tailored to yield 
the matrix logarithm of a unitary argument U |21j . 

At the beginning one has cos(H) and isin(if), defined as the hermitean and anti-hermitean 
parts of U. One may think of accumulating knowledge about subsequent half-angle sine and 
cosine functions, such that in the end one has sm(H/2 n ) / cos(H/2 n ) = tan(if/2 n ) and by means 
of an arctan-representation which is valid for small arguments one gets H/2 n and thus H. 

In fact, two tricks ease our task. The first one is the identity 

tan(#/4) = sm(H) \l + cos(H) + y/2[l + cos(#)] 1/2 l ^ (47) 
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which allows us to directly jump to the quarter-angle operator, at the expense of a single 
Denman-Beavers iteration (j4ip . The second one is again based on if/4 having a spectrum in 
the open interval ] — 7r/4, 7r/4[; now Z = tan(_ff/4) is just in the range of convergence of 

arctan(Z) = Z - -Z* + -Z 5 - ... = £ i^-L-Z 2 ** 1 (||Z|| < 1) . (48) 

As a result, no nested square- root scheme is needed, and the only exit criterion is concerned 
with the absolutely convergent series (j4"8l . In practice it proves useful to monitor the numerical 
convergence of the series, and to post-iterate with f)46p if needed. 



A. 6 Unitary projection with iterative methods 

The ability to calculate A~ x l 2 is also useful for the polar decomposition that is needed in the 
projection step of the traditional APE or HYP smearing procedure. Given a non-unitary V^(x) 
[the linear combination inside the wavy bracket in (j2J)], one proceeds in two steps. Usually, one 
thinks of first projecting to U(3), and then adjusting the determinant 

W^x) = V^x)[V^x)V^x)}- 1/2 , Up{x) = l^(x)/det 1/3 [Wyx)] (49) 

where W^(x) is the unitary part of V^(x), which is unique if V^(x) is nonsingular. In practice 
it is often a better choice to reverse the order, i.e. to compute 

W^(x) = U M (x)/det 1/3 [^(x)] , U^x) = W^iW^W^x)}- 1 ' 2 (50) 

to speed up the iterative polar decomposition, if the latter is done without rescaling. 

One way to compute the unitary part is via the quadratically convergent Newton iteration 

X k+1 = ^{ lk X k + ^\xlr l ) , X = A (51) 

with linifc^ooXfc = A(A^A)~ 1/2 , where 7fc = | det(X fc )r 1/2 , lk = (a min (X fc )a ma x(A fc ))~ 1/2 , or 
7/c = (HX^II/llXfcll) 1 / 2 (with the Frobenius norm) represent typical choices for the scaling 
parameter. Upon first adjusting the determinant to 1, i.e. upon using A = V^{x)/ det 1//3 [V^(x)], 
the choice 7& = 1 (V/c) becomes quite efficient. 

Another option is to use the product form of the Denman-Beavers iteration (|41|) for A^ 1 ^ 2 
with A = V^{x)V ll {x) I 'det 1/3 [Vj(x)V^(a;)] and leftmultiply the result with V^x)/ det 1/3 (^(;c)). 
Again, due to det(A) = l, this is efficient without further rescaling. 

Finally, it is worth pointing out that, for arbitrary given V, the matrix P U{N) = V(V*V) ~ 1/2 
maximizes RetiiV^U) over all unitary U, but the SU (N) matrix Pdet=iPu(N)V — Pu(N)Pdet=iV 
does not maximize Re tr(V*U) over all special unitary matrices. This is of interest in the context 
of a direct overrelaxation in SU(N), as discussed in [22J. 

A. 7 Eigenvalue based logarithm of a unitary matrix 

For an arbitrary 3x3 matrix A, the Vieta theorem for the characteristic polynomial yields 

Ai + A 2 + A 3 = Tr(A) , AxA 2 + A X A 3 + A 2 A 3 = det(A)Tr(A" 1 ) , AiA 2 A 3 = det(A) (52) 



23 



and for a unitary matrix this simplifies to 

Ai + A 2 + A 3 = Tr(U) , AiA 2 + AxA 3 + A 2 A 3 = Tr(Uf , AiA 2 A 3 = det(ET) (53) 

where the star denotes complex conjugation. 

As a result, the eigenvalue based method for computing the matrix logarithm of a unitary 
3x3 matrix U starts with the coefficients of the characteristic polynomial x 3 + ax 2 + bx + c = 0, 
that is with a = —Tr(U), b = Tr(U)* = —a*, c = — det(U). Next, form the complex quantities 
Q = a 2 /9-b/3 and R = a 3 /27-ab/6 + c/2 = aQ/3-ab/18 + c/2. If Q and R are both real and 
R 2 < Q 3 , then the cubic equation has three real roots. In the SU(3) case, this means either 
Ai = A 2 = A 3 = 1, or one eigenvalue +1 and two —1. Otherwise, form A = —[R + ^JR 2 — Q' A \ 1//3 
[with the complex square-root chosen such that Re(i?* V R 2 — Q 3 ) > 0] and B = Q/A. With 
these at hand, the solution (Cardano, 1545) 

Xl = (A+B)-^ , x 2 = -l(A+B)-^+i^-(A-B) , x 3 = - 1 -(A+B)-^-i^(A-B) (54) 

is written in such a way as to minimize round-off errors [23J . 

Having the eigenvalues, the pertinent eigenvectors may be found through solving a linear 
system; the eigenvector v^' (i = 1..3) is simply the solution of (U — \il)v® = 0. Building on the 
fact that S = U — Aj/ is singular, a convenient choice for the components of v ^ is the minors 
of, e.g., the first row: = S' 22 S' 33 - S 2 3S 32 , f 2 = S23S31 - S 2 iS 33} u| = S 21 S 32 - S 22 S 3 i. 
Here, it is assumed that the second and third row are linearly independent [but, of course, any 
row is in the span of the other two] . Whenever the maximum degeneracy is two-fold, there is 
at least one such choice which yields a non-zero eigenvector. 

With the normalized eigenvectors at hand, the principal logarithm of U would be \og(U) = 
J2i=i l°g(Aj)t> V'v The virtue of the eigenvalue based approach is that it gives us the means 
to identify the "troublesome" Aj whenever the principal logarithm has a non-zero trace. Assume 
the trace is +27ri; in that case log(Ai) + log(A 2 ) + log(A 3 ) = 27ri. Identify the log(Aj) with the 
largest imaginary part [they all lie on the imaginary axis]. Set /ij = log(Aj) — 27ri for that i and 
/ij = log(Aj) for the other two. The trace-free logarithm is then tflog(f/) = Y^=i (iiV^v® t, 
with an obvious generalization to the case where the trace of the principal logarithm is — 27ri. 
In mathematical terms this procedure is equivalent to shifting the cut of the logarithm such 
that one of the eigenvalues lies on the second Riemann sheet. 

Finally, note that this procedure is in marked contrast to the naive approach of subtracting 
one third of the trace from the principal logarithm, as is done in ([HI ED- This naive recipe 
amounts to shifting each log(Aj) by ±27ri/3. 

A. 8 Appendix summary 

While faster algorithms exist to compute the matrix exponential, the principal logarithm and 
the projection to SU(3), the iterative methods described in this appendix yield numerically 
stable results with 64-bit machine precision after 0{< 10) steps. Moreover, the possibility to 
warrant exact HMC reversibility, even if one opts for a lower precision, seems attractive. For 
the non-principal logarithm used in (jSj), the eigenvalue based method seems most convenient. 
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B Tricks for EO-preconditioned BCG75 solver 



In (full or quenched) QCD one solves the Dirac equation Dx = b for a given right-hand side b. 
A key feature of the clover D is that it couples nearest neighbors, apart from a mass and a clover 
contribution which do not hop at all. Accordingly, upon labeling the sites in a checkerboard 
("even-odd") fashion, the problem takes a block-offdiagonal form, except for the generalized 
mass contribution which remains site-diagonal. Hence D can be block-L[/-factorized [21] 

D <t £) = ^ = (i>J> ; - ?)(o °)(o ^ 

where the Schur complements L and U are easily inverted (by flipping the sign of the block- 
offdiagonal piece) and the block-diagonalized D takes the form 

f) - ( 1 \ ( Dcc Dco \ ( 1 - D ec D co \ _ ( D ee \ 

V —D D' 1 1 )\ D oe D ao J V 1 )~\ D -D D^D )' 1 } 

As a result, the original problem is split into three parts: (i) left-multiply the source with L^ 1 , 
that is define the new source c with c e = b e and c Q = b — D oe Dz}b e ; (ii) solve the non-trivial 
block problem D 00 y = c Q and the trivial one y e = Dz}c e ; (Hi) left-multiply the solution with 
U" 1 , that is obtain x from y through x Q = y a and x e = y e — Dz}D eo y . 

It is worth pointing out two peculiarities of the clover case. First, contrary to the Wilson 
case, where D OQ and D ee are constants, they now happen to be site diagonal, that is they are 
composed of V little 12 x 12 matrices [for SU(3) gauge group, with V the number of sites]. In 
the chiral representation, the latter are further reduced to two 6x6 matrices. Still, this has a 
severe impact on the memory requirement. In order to have a fast forward-application routine, 
it is customary to allocate an array which contains all 6 x 6 matrices needed. This amounts to 
a vector of 72V complex entries, to be compared to the 36V complex entries of the gauge field 
and the QV complex entries of a half-vector. 

Second, the non-trivial problem in step (ii) may be traded for the more symmetric version 
(1 — D~^D oe D~J-D eo )y = D~qC q with a slightly lowered condition number [251 126]. On the 
other hand, the reduced operator D Te d = ^(D 00 — D oe Dz}D eo ) is 75-hermitean; thus the original 
version can be mapped into H re ^y Q = \"i^c with the hermitean, indefinite H Ted = 75-D red . In 
addition, it implies that the operator D syra = 2(1 — D~^D oe Dz}D eo ) enjoys a 7 5 D 00 -hermiticity, 
that is Dl ym = 2(1 - 7 5 D oe L> cc 1 D co D 00 1 7 5 ) = -f 5 D 00 D sym D-^ 5 . Similarly, the other symmetric 
operator, that follows from D sym by eo-flipping the suffixes, is 75D ee -hermitean (and has an 
identical spectrum). In all these considerations it is assumed that the little 6x6 matrices are 
inverted beforehand, so that Dz} (and, if needed, D~^) are known. 

The operator D Te< i being 75-hermitean, it is tempting to use the BCG75-algorithm [27] to 
solve the reduced system in step (ii) above. As is well known, with finite-precision arithmetics 
BCG75 is prone to suffer from instabilities, and this holds true after EO-preconditioning, too. 
The instability is mainly due to the indefinite scalar products [with a 75 between the two vectors] 
genuine to this algorithm. Such an object may happen to be quite small in absolute magnitude, 
while the individual terms in the sum are not. In practice it is, nonetheless, observed that the 
algorithm performs quite convincingly, if the following three "tricks" are used: 

1. perform the summation in the (.,7 5 .)-type scalar products in quadruple precision [the 
products to be summed over are still computed in standard double precision] 
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2. recompute the true residue much more frequently than one would do in an algorithm 
without indefinite scalar products, e.g. every 10-th step instead of every 100-th step 



3. keep track of the vector which gave, so far, the smallest true residue, and restart, if certain 
conditions are met, from this vector. 

Evidently, devising a good set of conditions is important. In some of the simulations presented 
in this article, BCG75 was restarted if the norm of the actual residue was three orders of 
magnitude above the best residue encountered so far and, at the same time, the last restart 
would date back at least 500 steps. It turned out that a better condition is to demand that the 
minimum residue would persist for 500 steps and the last restart would date back at least 1500 
steps. It is advisable to code the routine such that - in case the required residual norm would 
not be reached within the maximum number of steps - it would return the best approximation 
encountered and another algorithm (e.g. CGNE) would take over. In the simulations presented 
in this article this has never happened. 

C Tensor calculus and chain rule for matrix functions 

In this appendix a convenient notation is introduced whereupon the chain rule for matrix valued 
functions looks like the usual chain rule for scalar functions. The material is mostly taken from 
[28J, but restricted to the case of square matrices, since this is all that we need. 

Considering a function Y(X) with X and Y both NxN matrices, the derivative BY/dX is 
conveniently represented as a N 2 x iV 2 matrix, since it is supposed to contain information on 
how each element of Y depends on each element of X. In the mathematical literature three 
conventions regarding the ordering of its entries may be found, but only one of them allows for 
a simple-looking generalization of the chain rule for scalar functions. 

For definiteness, let us recall some standard mathematical notation. A matrix A is repre- 
sented as A = {dijel) with the usual summation convention for repeated indices within a 
bracket. Here, e\ is an NxN matrix with zeros everywhere except for the place in row i and 
column j. The Kronecker product of A and B = {bkie l k ) is then a N 2 xN 2 matrix 



which is constructed by blowing up A and plugging B into every partition (multiplied with the 
element of the former A). Accordingly, the product a^h^i is found in position (p, q) of the new 
matrix with p = (i — l)N + k and q = (j — l)N + I. In fact, this is the reason for the notation 
used in (I57p . where only information on the ordering of covariant indices among themselves 
(and ditto for contravariant indices) is retained, but no information on the relative ordering of 
covariant and contravariant indices. One is invited to read (ik) as the new row index and (jl) 
as the new column index, with the breakdown into row and column number as given above. 
When traveling through the new matrix, i and j move slowly, while k and I move rapidly. Upon 
tensoring (157)) with a matrix C = (c mn e^J the old layout is again blown up and we arrive at 



with the row and column multiindices (ikm) and (jln) translating intop = ((i—l)N+k—l)N+m 
and q = ((j — l)N + l — l)N + n, respectively. The important point is that these Kronecker 



A® B = (KM4) 



(57) 



A ® B ® C = ({aijbkiCmnjel^) 
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products form an associative algebra, that is 



(A <g> B <g> C)(D <g> E <g> F) = (AD ®BE® CF) . 



(59) 



In the first definition the derivative of Y with respect to X is a partitioned matrix [dY/dxy] 
whose partition is put together by taking the derivatives dy^i/dx^ for all entries of Y. In 
other words, the layout is similar to that of X®Y, that is 



dY 



dm c ji 



(60) 



In the second definition the derivative of Y with respect to X is a partitioned matrix 
[dyki/dX] whose (k,l) partition is put together by taking the derivatives dy^i/dxij for all 
entries of X. In other words, the layout is similar to that of Y®X, that is 



dy, 



dX 



'dyki c ij S 



(61) 



In the third definition the derivative of Y with respect to X is obtained by first reshaping 
both matrices into N 2 x 1 column vectors, denoted X c and Y c , respectively. In this step the 
second column is appended to the first one, and so on. Then the derivative of the p-th element 
of X c with respect to the g-th element of Y c is stored in position (p, q) . In other words, the 
layout is similar to that of Y C ®(X C )' = ({afy-j/fcz}^), with ' denoting the transposition, that is 



dY c 
dX~ c 



Ik 



(62) 



When comparing these definitions one notices that the first two differ only by the order 
among the covariant indices and among the contravariant indices. Therefore, the two can be 
mapped into each other by means of transpositions. By contrast, the third definition differs 
from the previous two in a more profound manner; for this change covariant indices need to be 
converted into contravariant indices and vice versa. Another notation for this third definition, 
which sometimes occurs in the literature, is dvecY/dvec'X = dvecY/d(vecX)' . 

We start with having a look into the product rule with this third definition. Let Y(X) and 
Z(X) be two matrix valued functions which depend on X. According to the standard rules, 
the derivative of the product W = YZ = (wfc n e^) with respect to X is 



dW c 
~dX~ c 



dw 



dxij 



kn ji 
^nk 



dz mn ji 



(63) 



where w^ n = (ykoZon) has been used. On the other hand, the derivative of Y with respect to 
X is the N 2 xN 2 matrix (16"2"1) . which cannot be multiplied with Z, unless the latter is tensored 
with the identity / to have the right dimension. However, standard algebra gives 



-gyc- 




_dX c _ 


-{ 


~dZ c ~ 




dX c 


"( 



qk\ 



'dy, 



ki 



0X{j 



-Ik 



5 pqykiepk^j SqJim (j^ e nn 



Zml&pk&i 



^pnUkm^- 




dyki ji N 



OXij 



Vkr, 



dz. 



dx 



van n 



nk 



'.I 
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and therefore the product rule for matrix derivatives is seen to take the form 



d(YZ) c 
dX c 



[Z 1 <g> I) 



dY c 
dX~ c 



+ (I ® Y) 



dZ c 
dX~ c 



(64) 



In fact, upon generalizing the size of X to M x N, and that of Y and Z to P x Q and Qx R, 
respectively, the l.h.s. of ([Ml) is an PRxMN matrix. At the same time the first term on the 
r.h.s. is the product of a PR x PQ matrix (with I = Ip) and a PQ x MN matrix. And the 
second term is the product of a PRxQR matrix (with I = Ip) and a QRxMN matrix. Hence 
either term, and the r.h.s. in total, is of size PRx MN, in perfect agreement with the l.h.s. 

The main reason why it pays to choose the layout (1621) is that this definition allows for a 
neat extension of the standard chain rule. Let Z = Z(Y) be a matrix valued function of Y and 
Y = Y(X) a matrix valued function of X. With similar manipulations as above it follows that 



dZ c 
dX~ c 



dZ c 



dY c 
dX~ c 



(65) 



where the l.h.s. is a QRxMN matrix and the r.h.s. is a QRxPQ times a PQxMN matrix. 
Therefore, also this rule remains valid for arbitrarily shaped source and target matrices. Note 
that, with any of the other definitions of the matrix derivative, one would be forced to introduce 
a clumsy "star product" which does not obey the usual rules of matrix multiplication. 

It is worth noticing that the similarity to the chain rule for scalar functions is somehow 
limited, since the derivative of exp(.) and log(.) is not given by exp(.) and respectively 
[which would be matrices of inappropriate size]. Due to the infinite series in the definition of 
exp(X), the (k, I) element of the exponential depends on each Xij. Accordingly, <9exp(X) c /<9X c 
is a full N 2 xN 2 matrix, and a similar statement holds true for <91og(X) c /<9X c . Still, with 



d(X r 



d(x n - 1 ■ xy 



(X'®I) 



d(X 



n— l\c 



n-1 



dX c dX c v ' dX c 

for I = In it follows that the derivative of the n-th power can be written in compact form 



(66) 



d{X n ) 
dX c 



n-1 



n-1 



(67) 



and similarly the derivative of the inverse of a matrix Y = Y{X) is given by 



d(Y 



-l\c 



(Y^'fgjY- 1 ) 



_! dY c 



dX c ~~ ' dX c 

Thanks to the exponential being given by a globally convergent power series, it follows that 

<9exp(X) c 



(68) 



OO 71— 1 



<9X C 



71=1 U - 1=0 



(69) 



and the chain rule (|65|) says that the inverse of it, if it exists, is d\og{Y) c / dY c at Y — exp(X). 

With the formalism presented in this appendix the HMC force for an arbitrary fat-link 
action can be worked out in a way which relies only on the standard matrix multiplication law 
and may, as a result, benefit from optimized linear algebra subroutines. 
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